Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add of Online Hierarchical Clustering #1218

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

Conversation

kchardon
Copy link

No description provided.

@kchardon kchardon marked this pull request as draft April 11, 2023 11:55
@kchardon kchardon marked this pull request as ready for review April 11, 2023 12:26
@MaxHalford
Copy link
Member

Hey there! I hope it's ok for me to answer only by now.

Am I correct in assuming the algorithm stores all the data points it sees in memory (i.e. the X attribute)?

@kchardon
Copy link
Author

kchardon commented May 22, 2023

Hey there! I hope it's ok for me to answer only by now.

Am I correct in assuming the algorithm stores all the data points it sees in memory (i.e. the X attribute)?

Hello, sorry for replying that late.
I use the window_size attribute and when there are more data than allowed, it deletes the oldest data point.
If window_size < 1, then it stores all the data points

@MaxHalford
Copy link
Member

If window_size < 1, then it stores all the data points

Ok I see, fair. But I don't think we'll ever want that behavior. Could you remove it?

@kchardon
Copy link
Author

If window_size < 1, then it stores all the data points

Ok I see, fair. But I don't think we'll ever want that behavior. Could you remove it?

Yes I can. So I add an error if the value of window_size is not an integer > 0 ?

@MaxHalford
Copy link
Member

Nope, no need to check for an error. An exception will raise itself at some point. In general, we don't do input validation. Instead, we document well.

@kchardon
Copy link
Author

Okay ! I will delete it

@hoanganhngo610
Copy link
Contributor

@MaxHalford I think I'm quite happy with the current state of the code! Only one small concern is that the unit tests keep failing because there is an attribute error, saying that HierarchicalClustering does not have the attribute distance_func, although I believe atm there is no distance_func in my code anywhere anymore. Do you think of any potential reasons behind this?

@hoanganhngo610
Copy link
Contributor

@MaxHalford @kchardon If possible, I would really hope that both of you would be able to take a final look at the current state of the implementation and see if there are any changes you want to make. In case there is no opposition from both of you, when applicable, I would want to merge this into the main branch of River.

@MaxHalford
Copy link
Member

I will review next week :)

@MaxHalford
Copy link
Member

I (finally) took a look at this. There's a lot of nits to fix. But the main issue I see is that the code relies on numpy. It converts input dictionaries to numpy arrays. Would it be possible to use dictionaries only instead? I don't see any good reason to rely on numpy. Indeed, it's not in the spirit of River to rely on numpy when it can be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants